Workshop 1 - logistic regression

Regression

  • In the previous part, when we fitted a regression model, our “training data” were of the form \[ (y_i,{x}_i) \] where \(i\) indexes a particular data point.
  • For example, for each \(i\), we had:
    • \(y_i\) = time
    • \({x}_i\) = dist and climb
  • “Standard” linear models is designed for when the response variable, \(y_i\), (the thing we’re trying to predict/explain) is a continuous number, like time.

Generalised linear models

  • When the response is something other than a continuous number, we can turn to generalised linear models (GLM).
  • For example, for each incident \(i\), we could have:

    • \(y_i\) = whether or not there was casualty (0 = no, 1 = yes; or FALSE/TRUE)

    • \({x}_i\) = covariates about incident (data zone, ACORN category, etc)

  • The linear model from before doesn’t work here: \(y_i\) can’t be written as combination of the covariates \(x_i\) (because \(y_i\) is either 0 or 1).
  • A sensible target for our model is \(P(y_i=1 | x_i)\).
  • We read this as “The probability that \(y_i=1\) given that the incident had covariates \(x_i\)”.

An example with the hills data

  • What if we hadn’t seen the exact time, but only whether or not it was more than one hour?
  • Let’s use dplyr to make the data that way:
library(dplyr)
MASS::hills |> mutate(more_than_60_min = time > 60)
  • … then get rid of time and store the result as a new table:
hills_new <- MASS::hills |> mutate(more_than_60_min = time > 60) |> select(-time)
                 dist climb more_than_60_min
Greenmantle       2.5   650            FALSE
Carnethy          6.0  2500            FALSE
Craig Dunain      6.0   900            FALSE
Ben Rha           7.5   800            FALSE
Ben Lomond        8.0  3070             TRUE
Goatfell          8.0  2866             TRUE
Bens of Jura     16.0  7500             TRUE
Cairnpapple       6.0   800            FALSE
Scolty            5.0   800            FALSE
Traprain          6.0   650            FALSE
Lairig Ghru      28.0  2100             TRUE
Dollar            5.0  2000            FALSE
Lomonds           9.5  2200             TRUE
Cairn Table       6.0   500            FALSE
Eildon Two        4.5  1500            FALSE
Cairngorm        10.0  3000             TRUE
Seven Hills      14.0  2200             TRUE
Knock Hill        3.0   350             TRUE
Black Hill        4.5  1000            FALSE
Creag Beag        5.5   600            FALSE
Kildcon Hill      3.0   300            FALSE
Meall Ant-Suidhe  3.5  1500            FALSE
Half Ben Nevis    6.0  2200            FALSE
Cow Hill          2.0   900            FALSE
N Berwick Law     3.0   600            FALSE
Creag Dubh        4.0  2000            FALSE
Burnswark         6.0   800            FALSE
Largo Law         5.0   950            FALSE
Criffel           6.5  1750            FALSE
Acmony            5.0   500            FALSE
Ben Nevis        10.0  4400             TRUE
Knockfarrel       6.0   600            FALSE
Two Breweries    18.0  5200             TRUE
Cockleroi         4.5   850            FALSE
Moffat Chase     20.0  5000             TRUE

What do the data look like now?

  • How can we model \(P(\operatorname{time}>60|\operatorname{dist}, \operatorname{climb})\) ?

Logistic regression model

  • The model is

\[g(P(\operatorname{time}>60|\operatorname{dist}, \operatorname{climb})) = a+b\times \operatorname{dist}+c\times \operatorname{climb}\]

  • The right hand side looks just like the linear model from before.
  • But the left hand side involves the probability of \(\operatorname{time}>60\) and also a function \(g()\) called a link function.
  • For logistic regression we choose \[g(p) = \log\,[p/(1-p)].\] This makes sure all the predicted probabilities lie between 0 and 1.

Fitting logistic regression model in R

  • Practically, fitting (and working with fitted models) is easy in R.
  • We can use the function glm() (= generalised linear model).
fit <- glm(more_than_60_min ~ dist + climb, family = 'binomial', data = hills_new)
  • … then see details of the fit with
summary(fit)

Call:
glm(formula = more_than_60_min ~ dist + climb, family = "binomial", 
    data = hills_new)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.87594  -0.37295  -0.25784   0.00658   3.13752  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)   
(Intercept) -7.1708370  2.5300961  -2.834  0.00459 **
dist         0.6289965  0.4305802   1.461  0.14407   
climb        0.0010547  0.0008638   1.221  0.22207   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.574  on 34  degrees of freedom
Residual deviance: 16.145  on 32  degrees of freedom
AIC: 22.145

Number of Fisher Scoring iterations: 7

Predictions from the fitted model

  • We can also make predictions easily
predict(fit, list(dist = 10, climb = 2000), type="response")
        1 
0.7735595 
  • Does the prediction make sense in view of the plot from before?

Things to try

  • Question to investigate: Are some LSOAs associated with higher probability of casualty for any given incident?
  • Step 1: work out what we need from the incidents data.
  • Step 2: pull out what we need.
  • … first by adding a any_casualties variable
incidents |> mutate(any_casualties = Fire_Casualties > 0)
  • … then from the result selecting only the columns we need (assigning the resulting table dat).
dat <- incidents |> mutate(any_casualties = Fire_Casualties > 0) |> select(IncidentID, LSOArea, any_casualties)
  • Step 3: fit a logistic regression for \(P(\)at least one casualty\(|\)LSOA\()\).
fit <- glm(any_casualties ~ LSOArea, family = 'binomial', data = dat)
summary(fit)